24 research outputs found
Stochastic thermodynamics of learning
Unravelling the physical limits of information processing is an important goal of non-equilibrium statistical physics. It is motivated by the search for fundamental limits of computation, such as Landauer's bound on the minimal work required to erase one bit of information. Further inspiration comes from biology, where we would like to understand what makes single cells or the human brain so (energy-)efficient at processing information.
In this thesis, we analyse the thermodynamic efficiency of learning in neural networks. We first discuss the interplay of information processing and dissipation from the perspective of stochastic thermodynamics, a powerful framework to analyse the thermodynamics of strongly fluctuating systems far from equilibrium. We then show that the dissipation of any physical system, in particular a neural network, bounds the information that the network can infer from data or learn from a teacher. Along the way, we illustrate our thermodynamic bounds by looking at a number of examples and we outline directions for future research
Neural networks trained with SGD learn distributions of increasing complexity
The ability of deep neural networks to generalise well even when they
interpolate their training data has been explained using various "simplicity
biases". These theories postulate that neural networks avoid overfitting by
first learning simple functions, say a linear classifier, before learning more
complex, non-linear functions. Meanwhile, data structure is also recognised as
a key ingredient for good generalisation, yet its role in simplicity biases is
not yet understood. Here, we show that neural networks trained using stochastic
gradient descent initially classify their inputs using lower-order input
statistics, like mean and covariance, and exploit higher-order statistics only
later during training. We first demonstrate this distributional simplicity bias
(DSB) in a solvable model of a neural network trained on synthetic data. We
empirically demonstrate DSB in a range of deep convolutional networks and
visual transformers trained on CIFAR10, and show that it even holds in networks
pre-trained on ImageNet. We discuss the relation of DSB to other simplicity
biases and consider its implications for the principle of Gaussian universality
in learning.Comment: Source code available at https://github.com/sgoldt/dist_inc_com
The RL Perceptron: Generalisation Dynamics of Policy Learning in High Dimensions
Reinforcement learning (RL) algorithms have proven transformative in a range
of domains. To tackle real-world domains, these systems often use neural
networks to learn policies directly from pixels or other high-dimensional
sensory input. By contrast, much theory of RL has focused on discrete state
spaces or worst-case analysis, and fundamental questions remain about the
dynamics of policy learning in high-dimensional settings. Here, we propose a
solvable high-dimensional model of RL that can capture a variety of learning
protocols, and derive its typical dynamics as a set of closed-form ordinary
differential equations (ODEs). We derive optimal schedules for the learning
rates and task difficulty - analogous to annealing schemes and curricula during
training in RL - and show that the model exhibits rich behaviour, including
delayed learning under sparse rewards; a variety of learning regimes depending
on reward baselines; and a speed-accuracy trade-off driven by reward
stringency. Experiments on variants of the Procgen game "Bossfight" and Arcade
Learning Environment game "Pong" also show such a speed-accuracy trade-off in
practice. Together, these results take a step towards closing the gap between
theory and practice in high-dimensional RL.Comment: 10 pages, 6 figures, Preprin
Representation mitosis in wide neural networks
Deep neural networks (DNNs) defy the classical bias-variance trade-off:
adding parameters to a DNN that interpolates its training data will typically
improve its generalization performance. Explaining the mechanism behind this
``benign overfitting'' in deep networks remains an outstanding challenge. Here,
we study the last hidden layer representations of various state-of-the-art
convolutional neural networks and find evidence for an underlying mechanism
that we call "representation mitosis": if the last hidden representation is
wide enough, its neurons tend to split into groups which carry identical
information, and differ from each other only by a statistically independent
noise. Like in a mitosis process, the number of such groups, or ``clones'',
increases linearly with the width of the layer, but only if the width is above
a critical value. We show that a key ingredient to activate mitosis is
continuing the training process until the training error is zero
Quantifying lottery tickets under label noise: accuracy, calibration, and complexity
Pruning deep neural networks is a widely used strategy to alleviate the
computational burden in machine learning. Overwhelming empirical evidence
suggests that pruned models retain very high accuracy even with a tiny fraction
of parameters. However, relatively little work has gone into characterising the
small pruned networks obtained, beyond a measure of their accuracy. In this
paper, we use the sparse double descent approach to identify univocally and
characterise pruned models associated with classification tasks. We observe
empirically that, for a given task, iterative magnitude pruning (IMP) tends to
converge to networks of comparable sizes even when starting from full networks
with sizes ranging over orders of magnitude. We analyse the best pruned models
in a controlled experimental setup and show that their number of parameters
reflects task difficulty and that they are much better than full networks at
capturing the true conditional probability distribution of the labels. On real
data, we similarly observe that pruned models are less prone to overconfident
predictions. Our results suggest that pruned models obtained via IMP not only
have advantageous computational properties but also provide a better
representation of uncertainty in learning
The Gaussian equivalence of generative models for learning with shallow neural networks
Understanding the impact of data structure on the computational tractability
of learning is a key challenge for the theory of neural networks. Many
theoretical works do not explicitly model training data, or assume that inputs
are drawn component-wise independently from some simple probability
distribution. Here, we go beyond this simple paradigm by studying the
performance of neural networks trained on data drawn from pre-trained
generative models. This is possible due to a Gaussian equivalence stating that
the key metrics of interest, such as the training and test errors, can be fully
captured by an appropriately chosen Gaussian model. We provide three strands of
rigorous, analytical and numerical evidence corroborating this equivalence.
First, we establish rigorous conditions for the Gaussian equivalence to hold in
the case of single-layer generative models, as well as deterministic rates for
convergence in distribution. Second, we leverage this equivalence to derive a
closed set of equations describing the generalisation performance of two widely
studied machine learning problems: two-layer neural networks trained using
one-pass stochastic gradient descent, and full-batch pre-learned features or
kernel methods. Finally, we perform experiments demonstrating how our theory
applies to deep, pre-trained generative models. These results open a viable
path to the theoretical study of machine learning models with realistic data.Comment: The accompanying code for this paper is available at
https://github.com/sgoldt/gaussian-equiv-2laye
Learning curves of generic features maps for realistic datasets with a teacher-student model
Teacher-student models provide a framework in which the typical-case
performance of high-dimensional supervised learning can be described in closed
form. The assumptions of Gaussian i.i.d. input data underlying the canonical
teacher-student model may, however, be perceived as too restrictive to capture
the behaviour of realistic data sets. In this paper, we introduce a Gaussian
covariate generalisation of the model where the teacher and student can act on
different spaces, generated with fixed, but generic feature maps. While still
solvable in a closed form, this generalization is able to capture the learning
curves for a broad range of realistic data sets, thus redeeming the potential
of the teacher-student framework. Our contribution is then two-fold: First, we
prove a rigorous formula for the asymptotic training loss and generalisation
error. Second, we present a number of situations where the learning curve of
the model captures the one of a realistic data set learned with kernel
regression and classification, with out-of-the-box feature maps such as random
projections or scattering transforms, or with pre-learned ones - such as the
features learned by training multi-layer neural networks. We discuss both the
power and the limitations of the framework.Comment: v3: NeurIPS camera-read
Perspectives on adaptive dynamical systems
Adaptivity is a dynamical feature that is omnipresent in nature, socio-economics, and technology. For example, adaptive couplings appear in various real-world systems, such as the power grid, social, and neural networks, and they form the backbone of closed-loop control strategies and machine learning algorithms. In this article, we provide an interdisciplinary perspective on adaptive systems. We reflect on the notion and terminology of adaptivity in different disciplines and discuss which role adaptivity plays for various fields. We highlight common open challenges and give perspectives on future research directions, looking to inspire interdisciplinary approaches
Perspectives on adaptive dynamical systems
Adaptivity is a dynamical feature that is omnipresent in nature,
socio-economics, and technology. For example, adaptive couplings appear in
various real-world systems like the power grid, social, and neural networks,
and they form the backbone of closed-loop control strategies and machine
learning algorithms. In this article, we provide an interdisciplinary
perspective on adaptive systems. We reflect on the notion and terminology of
adaptivity in different disciplines and discuss which role adaptivity plays for
various fields. We highlight common open challenges, and give perspectives on
future research directions, looking to inspire interdisciplinary approaches.Comment: 46 pages, 9 figure